Introduction: Simple linear regression models are essential tools to examine associations between variables. These models provide estimates of the rate of change in the response variable (i.e., y-axis) given a change in the explanatory variable (x-axis). Because many biological variables are continuous and usually follow a normal distribution, simple linear regression models are fundamental tools to describe the relationship between variables of interest.
In this module, you will continue to define variables of interest and estimate their descriptive statistics (M.1 and M.2) while learning to fit simple linear regression models to test associations between fertility rate and age in female rhesus macaques.
Upon completion of this module, you will be able to:
References:
Extra training:
Associated literature:
Functions:
Base R:
ggplot2:
tidyverse:
Rhesus macaques are an informative comparative model due to several key demographic traits shared with many long-lived mammals, including humans. Long-lived mammals often show reproductive senescence (decreased in reproduction with age). However, the relationship of fertility rate as a function of age may be complex, showing periods of optimal (prime) reproduction, periods of stability, and other periods of decline. Describing these patterns is important to understand changes in mean fertility at the population-level, but also why some individual adult females (Fig 1) may not reproduce in a particular birth season.
Fig 1. Cayo Santiago rhesus macaque adult females with and infant. Photo by Raisa Hernández-Pacheco.
The Cayo Santiago rhesus macaque longitudinal demographic data on all individuals provides the unique opportunity to study important demographic rates such as age-specific fertility across the entire lifespan. In this module, you will be using Cayo Santiago rhesus macaque female reproductive data from Luevano et al 2022 (cayo_fert, Table 1) to explore whether female fertility is a function of age using simple linear regression analysis. This is authentic data collected through daily visual censuses performed by the staff of the Caribbean Primate Research Center (CPRC) of the University of Puerto Rico-Medical Sciences Campus.
Before coding, explore the data in Table 1.
Metadata of cayo_fert: Demographic data of Cayo Santiago female rhesus macaques from Luevano et al. 2022.
Before any analysis, check the data set and understand its attributes. Refer to Module M.1 for data structure exploration.
Guiding questions:
Yes!
Click for
Answer!
Each female was followed across their entire life in the island
(i.e., until they exited the population through death, culling, or right
censorship). Thus, the data provides information of the reproductive
performance of females per year of their life.
Click for
Answer!
Because the data provides information on the reproductive performance of females per year of their life, you can estimate the mean age-specific fertility rate (number of offspring produced at age x divided by the total number of females of age x).
Below, you will learn how to! You will use tidyverse and rstatix to summarize age-specific fertility rate (refer to Module M.2).
# loading packages
library(tidyverse)
library(rstatix)
# summary table with age-specific fertility
fert <- cayo_fert %>%
group_by(age) %>% # grouping by age
get_summary_stats(offspring) #summary of number of offspring
fert
Guiding questions:
Fertility increases with age up to a maximum value at age ~9 and then
decreases until age 24.
Click for
Answer!
Challenge!
Guiding questions:
The relationship between mean fertility and age is nonlinear. A few
females start reproducing at age 3, most females reproduce at ages 4 to
15, and then fertility decreases. After 24 years of age, no female
reproduces.
Click for
Answer!
Yes. A linear model would describe the negative association between
mean fertility and age.
Click for
Answer!
Below, you will fit a simple linear regression to the rhesus macaque female fertility data. For this, you will use the command lm(), where the first argument is the response variable, followed by the explanatory variable (separated by ~). To see the model output, you will use summary(). Make sure the variables are of class numeric!
# linear model
lm1 <- lm(mean ~ age, data = fert)
# model output
summary(lm1)
Guiding questions:
Call: the model formula defined.
Residuals: Minimum, 1st quartile, median, 3rd quartile, and
maximum value of the model residuals. As a refresher, the residuals are
the difference between the observed value of the response variable
(y-axis) and the predicted value from the regression line.
Coefficients: Model parameters (y-intercept and
slope) and statistics for each parameter (SE, t statistic, p-value).
Residual standard error: The standard deviation of the
residuals (smaller residual SE means model predictions are better)
R-squared: Coefficient of determination and indicates the
proportion of the variation in the dependent variable that is
predictable from the independent variable.
Click for
Answer!
y = -0.019259*age + 0.821764. Mean fertility at age 0 is
predicted to be 0.822. Per an increase in 1 year of age, mean fertility
decreases by 0.019.
Click for
Answer!
There are several ways to plot model predictions in R. First, you will implement the function predict() to extract model predictions (values of the linear regression). Second, you will implement the function geom_smooth(methods = “lm”) from ggplot2 to automatically plot the predicted values given the empirical data.
# plotting empirical and predicted values
p1 <- ggplot(fert,aes(age,mean)) +
geom_point() +
geom_line(aes(age,predict(lm1))) #adding new data to geom_line()
p1
# using geom_smooth for model predictions
p2 <- ggplot(fert,aes(age,mean)) +
geom_point() +
geom_smooth(method = "lm") #with SE
p2
Guiding questions:
As expected for a long-lived mammal, the mean fertility decreases with age. However, you can argue that a straight line (single slope) is not the best model to describe the observed association. According to the data, there is a maximum mean fertility at an optimal age. Below, you will fit the same simple linear regression model but now adding a quadratic term for age using the function I().
# linear model with quadratic term
lm2 <- lm(mean ~ age + I(age^2), data = fert)
# model output
summary(lm2)
As above, you can use the same functions to extract and plot model predictions.
# plotting empirical and predicted values
p3 <- ggplot(fert,aes(age,mean)) +
geom_point() +
geom_line(aes(age,predict(lm2))) #adding new data to geom_line()
p3
# using geom_smooth for model predictions
p4 <- ggplot(fert,aes(age,mean)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x + I(x^2)) #specifying the quadratic formula
p4
Challenge!
Improve these plots (refer to Module M.3).
Interpret model output and define the regression.
Question: Using your statistical knowledge and each model output, which model seems to best fit the data and why, lm1 or lm2?
Acknowledgements: The creation of this module was
funded by the National Science Foundation DBI BRC-BIO award 2217812.
Cayo Santiago is supported by the Office of Research Infrastructure
Programs (ORIP) of the National Institutes of Health, grant 2 P40
OD012217, and the University of Puerto Rico (UPR), Medical Sciences
Campus.
Authors: Raisa Hernández-Pacheco and Alexandra L. Bland, California State University, Long Beach